In this report we will do detailed analysis on different chemical composition of white wine and its effect on quality.
## [1] 4898 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
There are total of 4898 wine samples.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most of the wine in this data falls in the quality score of 5, 6, and 7. There is no wine in the data set with quality less than score of 3 or score of 10.
The fixed acidity has normal distribution.
The Volatile acidity has normal distribution.
The citric acid has normal distribution.
The chlorides has normal distribution.
The pH has normal distribution.
The sulphates has normal distribution.
The density has normal distribution.
The residual sugar has been log transformed, to identify differnt model that is present in it. The residual sugar has bimodal distribution, most wine fall in two sugar values.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Feature, free sulfur dioxide and total sulfur dioxide has outliers. For free sulfur dioxide the value of median 34, 3rd quadrant is 46 but max value is 289. Similarly for total sulfur dioxide median is 134, 3rd quadrant is 167 but max value is 440.
The alcohol seems to have uniform distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1225 2450 2450 3674 4898
It is difficult to visualize any information on X without doing any transformation. After doing log tranformation on variable X, I found x has right skewed distribution.
The data set has 4898 wines with 13 features (X, fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, Density, ph, sulphates, alcohol, quality). Following are observations made about the data * Most of the wine in this data falls in the quality score of 5, 6 and 7. * There is no wine in the data set with quality less than score of 3 or score of 10. * Feature alcohol is uniformly distributed. * Feature X is negatively skewed. * Other features (fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, alcohol, quality) are normally distributed.
The quality is the main feature of interest, I will find out how other features will influence quality. I strongly suspect residual sugar has some relationship with the quality of the wine. Logically it makes sense for alcohol content to have some relationship with wine quality. From this univariant analysis it very difficult to establish any relationship between quality and other features.
The feature X has unusual shape with histogram with default values, so I applied log transform to the data to obtain right skewed distribution. With most values falling around the value 2500. Residual sugar value is transformed from left skewed to bimodal distribution most of the wine falling around 3 or 9. The bin values is adjusted in all the histograms.
##
## Two-Step Estimates
##
## Correlations/Type of Correlation:
## X fixed.acidity volatile.acidity citric.acid
## X 1 Pearson Pearson Pearson
## fixed.acidity -0.2558 1 Pearson Pearson
## volatile.acidity 0.002858 -0.0227 1 Pearson
## citric.acid -0.1499 0.2892 -0.1495 1
## residual.sugar 0.006624 0.08902 0.06429 0.09421
## chlorides -0.04565 0.02309 0.07051 0.1144
## free.sulfur.dioxide -0.01193 -0.0494 -0.09701 0.09408
## total.sulfur.dioxide -0.162 0.09107 0.08926 0.1211
## density -0.186 0.2653 0.02711 0.1495
## pH -0.1158 -0.4259 -0.03192 -0.1637
## sulphates 0.009808 -0.01714 -0.03573 0.06233
## alcohol 0.2137 -0.1209 0.06772 -0.07573
## quality 0.03576 -0.1137 -0.1947 -0.009209
## residual.sugar chlorides free.sulfur.dioxide
## X Pearson Pearson Pearson
## fixed.acidity Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson
## residual.sugar 1 Pearson Pearson
## chlorides 0.08868 1 Pearson
## free.sulfur.dioxide 0.2991 0.1014 1
## total.sulfur.dioxide 0.4014 0.1989 0.6155
## density 0.839 0.2572 0.2942
## pH -0.1941 -0.09044 -0.0006178
## sulphates -0.02666 0.01676 0.05922
## alcohol -0.4506 -0.3602 -0.2501
## quality -0.09758 -0.2099 0.008158
## total.sulfur.dioxide density pH sulphates
## X Pearson Pearson Pearson Pearson
## fixed.acidity Pearson Pearson Pearson Pearson
## volatile.acidity Pearson Pearson Pearson Pearson
## citric.acid Pearson Pearson Pearson Pearson
## residual.sugar Pearson Pearson Pearson Pearson
## chlorides Pearson Pearson Pearson Pearson
## free.sulfur.dioxide Pearson Pearson Pearson Pearson
## total.sulfur.dioxide 1 Pearson Pearson Pearson
## density 0.5299 1 Pearson Pearson
## pH 0.002321 -0.09359 1 Pearson
## sulphates 0.1346 0.07449 0.156 1
## alcohol -0.4489 -0.7801 0.1214 -0.01743
## quality -0.1747 -0.3071 0.09943 0.05368
## alcohol quality
## X Pearson Pearson
## fixed.acidity Pearson Pearson
## volatile.acidity Pearson Pearson
## citric.acid Pearson Pearson
## residual.sugar Pearson Pearson
## chlorides Pearson Pearson
## free.sulfur.dioxide Pearson Pearson
## total.sulfur.dioxide Pearson Pearson
## density Pearson Pearson
## pH Pearson Pearson
## sulphates Pearson Pearson
## alcohol 1 Pearson
## quality 0.4356 1
##
## Standard Errors:
## X fixed.acidity volatile.acidity citric.acid
## X
## fixed.acidity 0.01336
## volatile.acidity 0.01429 0.01428
## citric.acid 0.01397 0.0131 0.01397
## residual.sugar 0.01429 0.01418 0.01423 0.01416
## chlorides 0.01426 0.01428 0.01422 0.0141
## free.sulfur.dioxide 0.01429 0.01426 0.01416 0.01416
## total.sulfur.dioxide 0.01392 0.01417 0.01418 0.01408
## density 0.0138 0.01328 0.01428 0.01397
## pH 0.0141 0.0117 0.01428 0.01391
## sulphates 0.01429 0.01429 0.01427 0.01423
## alcohol 0.01364 0.01408 0.01422 0.01421
## quality 0.01427 0.01411 0.01375 0.01429
## residual.sugar chlorides free.sulfur.dioxide
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides 0.01418
## free.sulfur.dioxide 0.01301 0.01414
## total.sulfur.dioxide 0.01199 0.01372 0.008878
## density 0.004233 0.01334 0.01305
## pH 0.01375 0.01417 0.01429
## sulphates 0.01428 0.01429 0.01424
## alcohol 0.01139 0.01244 0.0134
## quality 0.01415 0.01366 0.01429
## total.sulfur.dioxide density pH sulphates
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density 0.01028
## pH 0.01429 0.01416
## sulphates 0.01403 0.01421 0.01394
## alcohol 0.01141 0.005594 0.01408 0.01429
## quality 0.01385 0.01294 0.01415 0.01425
## alcohol
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density
## pH
## sulphates
## alcohol
## quality 0.01158
##
## n = 4898
##
## P-values for Tests of Bivariate Normality:
## X fixed.acidity volatile.acidity citric.acid
## X
## fixed.acidity 1.384e-135
## volatile.acidity 4.43e-79 8.326e-51
## citric.acid 8.099e-177 7.094e-126 3.11e-162
## residual.sugar 1.269e-153 3.961e-142 3.871e-146 6.704e-208
## chlorides 0 0 0 0
## free.sulfur.dioxide 2.436e-59 9.489e-44 2.307e-50 1.481e-110
## total.sulfur.dioxide 4.165e-65 1.731e-38 3.649e-49 2.145e-108
## density 6.906e-101 2.053e-49 1.458e-45 1.894e-132
## pH 2.823e-57 5.114e-36 2.379e-36 3.439e-101
## sulphates 1.308e-56 1.076e-33 4.068e-33 4.195e-103
## alcohol 3.053e-105 1.172e-74 1.458e-96 6.265e-186
## quality 0 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides 0
## free.sulfur.dioxide 2.279e-119 0
## total.sulfur.dioxide 9.659e-122 0 2.231e-30
## density 3.89e-196 0 1.384e-52
## pH 1.085e-119 0 3.012e-24
## sulphates 2.257e-116 0 1.06e-18
## alcohol 3.624e-202 0 9.643e-71
## quality 0 0 0
## total.sulfur.dioxide density pH sulphates
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density 1.193e-28
## pH 3.591e-17 1.448e-34
## sulphates 6.053e-32 1.796e-35 1.473e-17
## alcohol 2.343e-57 3.223e-108 2.598e-62 3.961e-84
## quality 0 0 0 0
## alcohol
## X
## fixed.acidity
## volatile.acidity
## citric.acid
## residual.sugar
## chlorides
## free.sulfur.dioxide
## total.sulfur.dioxide
## density
## pH
## sulphates
## alcohol
## quality 0
The Pearson R between quality and features are quiet low. The alcohol and quality has highest Pearson R at 0.4356. Other features that might influence quality are fixed acidity, volatile acidity,residual sugar,chlorides, total sulfur dioxide, density and sulphates.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
From the scatter plot between alcohol and quality we can see alcohol quality at 5,6 and 7 have range of alcohol content from 8 percent to 13 percent. The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Density and alcohol seems to have negative correlation with Pearson R value of -0.7801. Wine with higher alcohol content have lower density.
##
## Pearson's product-moment correlation
##
## data: wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
The residual sugar and density has positive correlation with correlation value of 0.839
As expected the density and quality is opposite of alcohol and quality distribution. For instance lower wine quality have predominantly higher quality and lower density.
It is difficult to establish any relationship between fixed acidity and quality. The volatile acidity has lot of outliers, even after removing outliers I cannot establish any relationship between quality and fixed acidity. All the quality values has similar distribution of volatile acidity and fixed acidity.
The residual sugar has lot of outliers, so only top 99 percentile is taken into analysis. From the scatter plot we can infer that wine quality which is 4 or below has predominantly lower residual sugar. Wine quality of 5 and above have similar distribution of residual sugar to one another. This is surprise as I was expecting distribution similar to density and quality, but it was similar to alcohol and quality.
The chlorides has some outliers so only top 99 percentile is considered. The middle wine quality has wide range of chlorides on the other hand lower and higher wine quality has lower chloride values. This may be because middle wine quality has more sample and hence the variance is quiet high.
No relationship could be drawn using quality and total sulfur dioxide.
Different wine quality of wine has similar distribution of sulphates.
The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from alcohol quality 6 and above. As expected we can see exact opposite trend with density.
I could not find any definite pattern with boxplot for quality with residual sugar, total sulfur dioxide, sulphates.
Lower quality wine has higher median chloride content compared to higher wine quality.
We can find relationship between quality and alcohol with correlation coefficient of 0.4356. The higher quality of wine has higher the alcohol content. The higher wine quality of 7, 8 and 9 have higher median alcohol content compared to lower wine quality
The next feature that influence the wine quality is density, it has correlation value of -0.3071, the density is physical property which is affected by other chemical feature that is present in the wine, in our case it is affected by alcohol and residual sugar. I strongly suspect density does not affect wine quality in a big way as the density itself affected by presence of other chemicals.
The chloride has negative correlation with wine quality with correlation coefficient of -0.2099. The higher wine quality of 7, 8 and 9 have lower median chloride content compared to lower wine quality.
The other features that seems to have effect on wine quality are fixed acidity, volatile acidity, residual sugar, total sulfur dioxide and sulphates. Further analysis is needed to determine the relationship between these feature and wine quality.
The density strongly correlates with residual sugar. The correlation coefficient between density and residual sugar is 0.839.
There is strong negative correlation between alcohol and density, higher the percentage of alcohol lower is the density.
I found strong positive correlation between alcohol and quality. The density had strong negative correlation. The chloride is another feature that has negative correlation.
##
## high low medium
## 1060 1640 2198
The wine is divided into three categories of low, medium and high. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.
The low wine quality has most of alcohol value of 11 or lower and volatile acidity in range of 0.2 to 0.6. The medium wine quality has alcohol content are predominantly 11 or lower and volatile acidity in range of 0.1 to 0.5. The high wine quality has alcohol content that are predominantly 11 or higher and volatile acidity in range of 0.1 to 0.5. This behavior is quiet expected as positive correlation between alcohol and wine quality whereas we have negative correlation between volatile acidity and wine quality.
Low and medium wine quality has most of fixed acidity value from 5 to 8.5 and alcohol content less than 11. Whereas high wine quality has most of the values from 5 to 7.5 and alcohol content higher than 11. This is consistent with our correlation analysis.
Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis.
Low and medium wine quality has most of Free Sulfur Dioxide value from 10 to 60 and alcohol content less than 11. Whereas high wine quality has most of the values from 25 to 50 and alcohol content higher than 11. This explains the weak correlation between wine quality and free sulfur dioxide.
Low and medium wine quality has most of sulphates value from .3 to .6 and alcohol content less than 11. Whereas high wine quality has most of the values from .25 to .7 and alcohol content higher than 11. This is consistent with our correlation analysis.